Object hallucination

Large vision-language models (LVLMs) tends to hallucinate nonexistent objects in the image, maybe because of strong language prior and spurious co-occurrence in the training data.

Types

Object hallucination.
incorrect description

Questions

How and why does it occur? Can we pin down and explain the mechanism back to the training data?
- proposals: Statistical pre-training bias (Agarwal et al., Goyal et al.), over-reliance on language prior (Leng et al., Lee et al., Zhibo et al., Han et al., Wu et al.,), biased feature learning. (see Liu2024reducing). These are explanations applicable to both LVLMs and LLMs but Liu2024reducing argues that we should also think about mechanisms specific to LVLMs.
False positivies vs. false negatives?
How to mitigate the hallucination?
- Reasoning?
- Surprisingly popular algorithm?

Evaluation

Rohrbach2018object proposed a simple metric CHAIR metric to quantify the object hallucination. It is simply the ratio between the number of hallucinated objects (sentences) and all objects mentioned (all sentences).

Li2023evaluating proposed POPE (Polling based Object Probing Evaluation). The idea is asking yes-or-no questions about objects in the scene, where nonexistent objects are sampled by three strategies: random, popular, and adversarial. Popular sampling samples from top-k in the data and adversarial sampling samples top-k from a list sorted by the co-occurrence with the ground-truth objects.

Mitigation

Liu2024reducing studies both why hallucination in LVLMs arise and how to mitigate them.